Visual scene understanding is an important capability that enables robots topurposefully act in their environment. In this paper, we propose a novelapproach to object-class segmentation from multiple RGB-D views using deeplearning. We train a deep neural network to predict object-class semantics thatis consistent from several view points in a semi-supervised way. At test time,the semantics predictions of our network can be fused more consistently insemantic keyframe maps than predictions of a network trained on individualviews. We base our network architecture on a recent single-view deep learningapproach to RGB and depth fusion for semantic object-class segmentation andenhance it with multi-scale loss minimization. We obtain the camera trajectoryusing RGB-D SLAM and warp the predictions of RGB-D images into ground-truthannotated frames in order to enforce multi-view consistency during training. Attest time, predictions from multiple views are fused into keyframes. We proposeand analyze several methods for enforcing multi-view consistency duringtraining and testing. We evaluate the benefit of multi-view consistencytraining and demonstrate that pooling of deep features and fusion over multipleviews outperforms single-view baselines on the NYUDv2 benchmark for semanticsegmentation. Our end-to-end trained network achieves state-of-the-artperformance on the NYUDv2 dataset in single-view segmentation as well asmulti-view semantic fusion.
展开▼